[Databricks] Setup a workspace
In this section, we will create a Databricks workspace to work with datalake and Pyspark.
Create a Databricks workspace
First, you need to connect to your Azure account:
From the Azure homepage, search for Azure Databricks in the search bar. Then click on Create they will ask you to fill in some information that is explained below:
- Subscription: you should select your school subscription
- Resource group: create a new resource group where your workspace will be stored
- Workspace name: select the name you want for your workspace
- Region: select the region of your workspace, select West Europe
- Pricing tier: select the pricing plan you want, select Trial
Once fields are filled in, you can review & create your Databricks Workspace. Wait several minutes and now you can access your Databricks Workspace!
Create and configure a Databricks cluster
A Databricks cluster is the compute part that lets you run code on Databricks. Depending on the provider, you are using virtual machines with a specific configuration to run on.
Go to the compute section and Create compute with the following configuration:
- single node
- unselect use Photon Acceleration
- node type as Standard_DS3_V2
- terminate after 60 minutes of inactivity
Your cluster will start and will appear in the compute>All-purpose compute section. Click on your newly created cluster and then go to the Libraries section.
Click on Install New, select Pypi as the Library Source and enter the package you wish to install on the cluster in the Package section. You can select a specific version of your package with the following syntax package==1.0.3.
Run your code on Databricks
Hence the environment setup, we can try to run some codes on a Databricks Jupyter notebook. Go to the section Workspace>Workspace>Add>Notebook. We will use the same Python code example as the page API basics:
import requests
import pandas as pd
import json
pd.set_option('display.max_columns', 500)
# Set your Alpha Vantage API Key
AV_API_Key='YOURAPIKEY'
search_keyword='meta'
url = 'https://www.alphavantage.co/query?function=SYMBOL_SEARCH&keywords={searchKeyword}&apikey={apiKey}'.format(apiKey=AV_API_Key, searchKeyword=search_keyword)
r_search = requests.get(url)
js_search = r_search.json()
df_search=pd.DataFrame(js_search)
display(df_search)
Well done, you can now use Databricks to run Python code!
Tip
If you want to go further, you can check my blog post on how Web Scrape data with Databricks: How to use Selenium package on Databricks.
Clone a GitHub repository to Databricks
Connect your GitHub account to your Databricks workspace
Click on your mail address top right, then User Settings>Linked Account. Change the git provider by GitHub and select Link Git account to link your GitHub account to your Databricks workspace.
Clone your repository
Go to Workspace>Repos>Add repo
Then they will ask you for the following information before creating the repository:
- Git repository URL: URL available on GitHub under Code>Https link of your repository.
- Git provider: GitHub
- Repository name: Name of the Databricks repository
Tip
Repository information (Git provider and repository) will be generated automatically from the link pasted in the Git repository URL
To check if your repository is well cloned, you should find in the repository the README file initially created under Repos>YourMailAddress>YourRepositoryName or any file available in your repository.
Upgrade Databricks workspace to premium
At the end of your Databricks trial period, you will need to upgrade your workspace to premium. From your Azure portal, go to your Azure Databricks resources and select upgrade to premium>upgrade. Wait a few minutes before the change is applied.